Spring R ’24
Dominic Bordelon, Research Data Librarian
University Library System, University of Pittsburgh
dbordelon@pitt.edu
Services for the Pitt community:
Support areas and interests:
| # | Date | Title |
|---|---|---|
| 1 ⭐ | 2/22 | Getting Started with Tabular Data |
| 2 | 2/29 | Working with Data Frames |
| 3 | 3/7 | Data Visualization |
| 4 | 3/21 | Inference and Modeling Intro |
| 5 | 3/28 | Machine Learning Intro |
Converting from XML or unstructured data to tabular format is popular, because it facilitates statistical analysis. An image containing
In this course, we are working with string data—interpreted as numeric values and pieces of text—in a tabular format.
R is…
RStudio (Desktop) is…
Here is a (very small) sampling of papers that use analytic methods which are available in R.
Results: Two analyses generated clusters with high concentrations of PARS cases. The first analysis (N= 136; PARS= 34) revealed a cluster containing 83% PARS cases, in which the patients showed a significant discrepancy between verbal and performance intelligence…. The second analysis (N= 123; PARS= 30) revealed a cluster containing 71% PARS cases, of which 93% were females; the mean age of onset of psychosis, at 17.2, was significantly early.
Conclusions: These results strengthen the evidence that PARS cases differ from other patients with schizophrenia….These findings provide a rationale for separating these phenotypes from others in future clinical, genetic and pathophysiologic studies of schizophrenia and in considering responses to treatment. (Lee et al. 2011)
versus Excel:
✅ Non-proprietary, open source
✅ Powerful and fast interactions with data
✅ Very extensible
✅ Research-oriented community
✅ Reproducible and visible interactions with data
✅ Data viz makes sense to me
✅ Can handle more data for a given quantity of system resources
✅ Less prone to accidental user error
❌ R has a steeper learning curve
❌ R/RStudio doesn’t have convenient data entry
versus Python:
✅ Purpose-built for stats
✅ Simpler mental model and syntax (for tabular data work)
✅ RStudio is better than (free) IDEs for Python
✅ I can always call Python from within R if I need to
❌ R has a smaller (but more focused) community with less published code
+ - * / ^ (exponentiation) %% (modulus) %/% (integer division)sum(), mean(), median(), mode(), min(), max(), sd(), sqrt(), abs()log(x) for natural and log10(x) for base 10 (or log(x, base) for any base you want)exp(1), where 1 is the desired exponent of \(e\).round() for decimal places, signif() for specifiying significant digits
floor(), ceiling()c() function (“combine”)▟ TL: Source/Editor 📝
Write scripts and R notebooks in tabs
▜ BL: Console 👩💻
Run commands
⬆ for command history
⭾ tab key for suggestions
▙ TR: Environment, History 🌐
Objects in your workspace (session); Import Dataset
▛ BR: Help, Files, Packages ❓🔍📦
All extremely useful!
R Notebook (.Rmd) or Quarto document (.qmd): mix formatted text and code and code outputs
R script (.R): plain-text file that can be executed by R directly
R Project (.Rproj): lives in the directory for a given project, and holds information like command history and settings. Optional but recommended.
.RData: a workspace (session) snapshot
.rds: an R data structure, i.e., an R object which has been saved to the filesystem
Of course, you will also be loading files in whatever format your data take (spreadsheets, shapefiles, etc.).
Protip: make sure your operating system is set to display all file extensions!
Keyboard shortcuts
Windows:
Mac:
data, and a file called patients.csv.
/users/djb190/Documents/projects/R/study-x/data/patients.csvdata/patients.csvgetwd() to check your current working directoryOr if you like to write code: install.packages("name-of-package")
Let’s install the tidyverse, a collection of packages that we’ll use for the rest of the course: install.packages("tidyverse")
library(package-name)Now let’s attach tidyverse: library(tidyverse)
You are likely to encounter tabular data in the following storage formats:
.csv, .tsv, .dat, .txt.xlsx, .xls.ods.parquetWe are going to focus on CSV, since it is a non-proprietary and extremely common format.
library(readr) will attach readr, but it is included in tidyverse (which we already have attached)read_csv() function!readr::read_csv().In order to do something with our data, besides look at them once, we need to tell R to assign the result of our expression—i.e., the output of read_csv()—to an object. We also sometimes call this storing or saving an object
We use a left-pointing arrow, <- (type less-than and hyphen) for assignment:
Keyboard shortcuts
Windows:
Alt - (alt-hyphen) inserts an assignment operator, <-
Mac:
⌥ - (option-hyphen) inserts an assignment operator, <-
You may also use = (equals) for object assignment, although it is not recommended.
Use the View() function on our loaded data to launch the Data Viewer, for example: View(my_values). This is the same as clicking the object’s name in the Environment pane.
You can also type the object’s name to see a brief textual representation of it, in the console or notebook.
A few more useful functions:
dplyr::glimpse() or str() shows all columns listed top-to-bottomhead() and tail() shows the top or bottom of the data framesummary() summarizes a vector (or each vector in a data frame) according to its data typeEvery value in R has one of these types:
numeric: real, decimal numbersinteger: whole numberscharacter: text; should always be in quotation marks " " in codelogical: TRUE and FALSE, also called Booleancomplex: for imaginary values i.e. complex numbers (rare)raw: values are stored as bytes and not human-readable (rare)Regardless of type, every value is organized into a structure (usually with other values). These are the most common structures:
data_frame$var_namewrite_csv()write_csv("patients.csv")The data frame is the “R-native” representation of the data. We read and write to an interchange format (CSV) to save and/or share our work.
We learned about:
Next time: exploring data frames!
R 1: Getting Started with Tabular Data